NBSVM for sentiment and topic classification

Paper Summary

The inclusion of word bigram features gives consistent gains on sentiment analysis tasks;
NB does better than SVMs on small snippet sentiment tasks. (Opposite on longer documents);
A novel way: SVM using NB log-count ratios variant as feature.
MNB is normally better and more stable than multivariable Bernouli NBm and binarized MNB is better than standard MNB.

Model

main model variants

Linear classifiers: $$y^{(k)}=sign(\boldsymbol{w}^T\boldsymbol{x}^{(k)}+b)$$

$\boldsymbol{f}^{(i)}\in R^{|V|}$ is the feature count vector for training case i with label $y^{(i)}\in{-1,1}$, $V$ is the set of features, $\boldsymbol{f}_j^{(i)}$ represents the number of occurences of feature $V_j$ in training case $i$.$$\boldsymbol{p}=\alpha + \sum_{i:y^{(i)}=1}\boldsymbol{f}^{(i)}$$ and $$\boldsymbol{q}=\alpha + \sum_{i:y^{(i)}=-1}\boldsymbol{f}^{(i)}$$ for smoothing parameter $\alpha$. The log-count ratio is:

$$\boldsymbol{r} = log(\frac{\boldsymbol{p/||\boldsymbol{p}||_1}}{\boldsymbol{q}/||\boldsymbol{q}||_1})$$

Multinomial Naive Bayes

Cited and expanded by naive bayes from wikipedia.

With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial $(p_1,…,p_n)$ where $p_i$ is the probability that event i occurs. In our case, If we consider the whole documents as a word bag, the feature of one sample sentence would be the frequency of each word appears in this word bag, that will give us:

$$p(\boldsymbol{x}|C_k)=\frac{(\sum_ix_i)!}{\prod_i x_i!}\prod_i p_{ki}^{x_i}$$

As we see product integral here, applying log makes it become linear:

$$\log p(C_k|\boldsymbol{x})\propto \log\left(p(C_k)\prod_{i=1}^np_{ki}^{x_i}\right)= \log p(C_k)+\sum_{i=1}^{n}x_i\log p_{ki}=b + \boldsymbol{w}_k^T\boldsymbol{x}$$

$b=\log p(C_k)$ and $\boldsymbol{w}{ki} = \log{p{ki}}$

Smoothing is needed because if a given class and feature value never occur together in the training data, then the frequency-based probability estimate will be zero.

Application in Quora Insincere Question Classification

Theory Explanation

We are given 1306122 labeled training question texts and 56370 question texts to be predicted, all questions are classified as either Insincere or Sincere.

To begin with, we want to know the probability of classification of a given sentence: $ P(Class|Sentence) $, since we only have two classes 0,1. We can determine the class by dividing them:

$$result=\frac{P(C=1|S)}{P(C=0|S)}$$

If it’s bigger than 1, the sentence should be classify as 1 (insincere), otherwise 0.

Problem here is how to get the $P(C|S)$:
According to Naive Bayes,

$$P(C=1|S)=\frac{P(S|C=1)P(C=1)}{P(S)},P(C=0|S)=\frac{P(S|C=0)P(C=0)}{P(S)}$$

thus,
$$result=\frac{P(S|C=1)}{P(S|C=0)}\frac{P(C=1)}{P(C=0)}$$

$\frac{P(C=1)}{P(C=0)}$ here is a constant, which is the fraction of counts of labeled 1 and 0 question texts.

For P(S|C=1), we consider in Naive Bayes way that every word appears independently. With this assumption, we can say

$$P(S|C=1) = P(w_1|C)P(w_2|C)P(w_3|C)…P(w_n|C)$$
$w_i$ is the $i$-th word in sentence $S$

Thus, all we need here is $P(w|C=1)$ and $P(w|C=0)$ for all words in the bag. For every $P(S|C)$, we can just multiply words probabilities together.

Therefore,
$$result=\frac{\prod_{i=0}^nP(w_i|C=1)}{\prod_{i=0}^nP(w_i|C=0)}\frac{P(C=1)}{P(C=0)}$$

Define log ratio by taking log from result:
$$r =\log\frac{ratio\ of\ word\ w\ in\ class\ 1}{ratio\ of\ word\ w\ in\ class\ 0}= \log\frac{\frac{\boldsymbol{p}}{||\boldsymbol{p}||}}{\frac{\boldsymbol{q}}{||\boldsymbol{q}||}} $$

$\boldsymbol{p}$ here is obtained by adding up every feature(words) in each row(sentence) who belongs to class 1, that gives us $\sum_{i=0}^nlog{P(w_i|C=1)}$. $||\boldsymbol{p}||$ is the normalization term which is $\sum_{i=0}^n P(C=1)$. Considering the smoothing term to prevent the situation that some words never appear in a particular class, $\boldsymbol{p}=\alpha + \sum_{i=0}^nlog{P(w_i|C=1)}$.

Brief Implementation Explanation

Here’s the Detailed Notebook Link with explanation on how to implement the NBLR in this kaggle competition.

In python where x is the feature matrix and y is the target, $\boldsymbol{p}$ equals alpha + x[y==1].sum(0) and $||\boldsymbol{p}||$ is alpha + (y==1).sum().

We can obtain the TF-IDF matrix by:

import re, string
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')

def tokenize(s): return re_tok.sub(' ', s).split()

vec = TfidfVectorizer(ngram_range=(1,2), tokenizer=tokenize, 
                     min_df=3, max_df=0.9, strip_accents='unicode',
                     use_idf=1, smooth_idf=1, sublinear_tf=1)
                     
tr_term = vec.fit_transform(X_train)
va_term = vec.transform(X_val)
te_term = vec.transform(test_df['question_text'])

Then built models like the paper mentioned before:

def pr(y_i, y, alpha=1):
    p = x[y==y_i].sum(0)+alpha
    p_norm = (y==y_i).sum()+alpha
    return p/p_norm

def nblr(y):
    y = y.values
    r = np.log(pr(1,y) / pr(0,y))
    m = LogisticRegression(C=4, dual=True, max_iter=500)
    x_nb = x.multiply(r)
    return m.fit(x_nb, y), r

At last, we get 0.57157 F1 score for this simple baseline. There’s a lot more can be done to improve this one:

The distribution of positive and negative samples are extremely unbalanced, we can try downsampling or data augmentation.
Language preprocessing are not done, like clean up stopwords.
Various linear class classifier can be applied, like SVM.

Credit

Info: 2012, ACL, Paper Link
There’s an explanation about nbsvm here: Kaggle notebook by Jeremy Howard
Some detail about its L1 norm: Detail explanation by Zhangyang

← Steepest Descent

赏

使用支付宝打赏

使用微信打赏

若你觉得我的文章对你有帮助，欢迎点击上方按钮对我打赏

扫描二维码，分享此文章